Energy increasingly constrains modern computer hardware, yet protecting computations and data against errors costs energy. This holds at all scales, but especially for the largest parallel computers being built and planned today. As processor counts continue to grow, the cost of ensuring reliability consistently throughout an application will become unbearable. However, many algorithms only need reliability for certain data and phases of computation. This suggests an algorithm and system codesign approach. We show that if the system lets ap-plications apply reliability selectively, we can develop algorithms that compute the right answer despite faults. These “fault-tolerant ” iterative methods either converge eventually, at a rate that degr...
This paper continues to develop a fault tolerant extension of the sparse grid combination technique ...
We present a fault model designed to bring out the “worst” in iterative solvers based on mathematica...
This paper continues to develop a fault tolerant extension of the sparse grid combination technique ...
As large-scale linear equation systems are pervasive in many scientific fields, great efforts have b...
We present a new approach to fault tolerance for High Performance Computing system. Our approach is ...
With the proliferation of parallel and distributed systems, it is an increasingly important problem ...
ELLIOTT III, JAMES JOHN. Resilient Iterative Linear Solvers Running Through Errors. (Under the direc...
Iterative methods are commonly used approaches to solve large, sparse linear systems, which are fund...
Proceedings of the First PhD Symposium on Sustainable Ultrascale Computing Systems (NESUS PhD 2016)...
Several recovery techniques for parallel iterative methods are presented. First, the implementation ...
As late-CMOS process scaling leads to increasingly variable circuits/logic and as most post-CMOS tec...
Dense matrix factorizations, such as LU, Cholesky and QR, are widely used for scientific application...
Some of today’s applications run on computer platforms with large and inexpensive memories, which ar...
Abstract—As hardware devices like processor cores and memory sub-systems based on nano-scale technol...
As the number of processors in today’s parallel systems continues to grow, the mean-time-to-failure ...
This paper continues to develop a fault tolerant extension of the sparse grid combination technique ...
We present a fault model designed to bring out the “worst” in iterative solvers based on mathematica...
This paper continues to develop a fault tolerant extension of the sparse grid combination technique ...
As large-scale linear equation systems are pervasive in many scientific fields, great efforts have b...
We present a new approach to fault tolerance for High Performance Computing system. Our approach is ...
With the proliferation of parallel and distributed systems, it is an increasingly important problem ...
ELLIOTT III, JAMES JOHN. Resilient Iterative Linear Solvers Running Through Errors. (Under the direc...
Iterative methods are commonly used approaches to solve large, sparse linear systems, which are fund...
Proceedings of the First PhD Symposium on Sustainable Ultrascale Computing Systems (NESUS PhD 2016)...
Several recovery techniques for parallel iterative methods are presented. First, the implementation ...
As late-CMOS process scaling leads to increasingly variable circuits/logic and as most post-CMOS tec...
Dense matrix factorizations, such as LU, Cholesky and QR, are widely used for scientific application...
Some of today’s applications run on computer platforms with large and inexpensive memories, which ar...
Abstract—As hardware devices like processor cores and memory sub-systems based on nano-scale technol...
As the number of processors in today’s parallel systems continues to grow, the mean-time-to-failure ...
This paper continues to develop a fault tolerant extension of the sparse grid combination technique ...
We present a fault model designed to bring out the “worst” in iterative solvers based on mathematica...
This paper continues to develop a fault tolerant extension of the sparse grid combination technique ...